[Feature] Semantics support and remote atomic add by tzj-fxz · Pull Request #48 · tile-ai/tilescale

tzj-fxz · 2026-02-04T07:03:31Z

Add semantics and scope support for wait operation
Add remote atomic add operation
Add example of manually routing alltoall intranode

WIP

Consistent scope identifiers for memory operation
Optimize routing examples

Summary by CodeRabbit

New Features
- Added intra‑node distributed all‑to‑all examples (including 2D torus XY routing) with verification, CLI launch, and optional benchmarking; added remote atomic‑add operation.
API Changes
- Wait and fence APIs now accept explicit memory scope and semantic parameters; legacy single‑arg wait variant removed.
- Barrier API gains a need_fence flag to control automatic fencing.
Refactor
- Replaced string‑based scope/semantic parameters with enum types for clearer, safer calls.

coderabbitai · 2026-02-04T07:03:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds enum-based memory scopes/semantics, expands wait/fence/atomic primitives to be scope/semantic-aware (including cross-PE waits), introduces a remote atomic-add TileLang operator with PTX lowering, updates CUDA templates/codegen and wait/fence signatures, and adds two intra-node all‑to‑all example scripts.

Changes

Cohort / File(s)	Summary
Distributed all‑to‑all examples `examples/distributed/intranode/example_alltoall.py`, `examples/distributed/intranode/example_alltoall_route2x4.py`	Add two intra-node all‑to‑all example scripts: tiled alltoall and torus XY‑routing multi‑phase alltoall with TileLang kernels, per-rank launchers, multiprocessing spawn, PyTorch reference verification, and optional benchmarking/timing.
Remote atomic operations `src/op/remote_copy.h`, `src/op/remote_copy.cc`	Add `AtomAddRemoteOpNode`/`AtomAddRemoteOp` types, registration and reflection; implement lowering to PTX atom-add naming that encodes semantic/scope; update semantic/scope mappings and bounds checks used during lowering.
Synchronization core & operators `src/op/sync.h`, `src/op/sync.cc`	Remove `wait_eq` builtin; add `scope` and `semantic` fields to `WaitOpNode`; change WaitOp registration arity (4→6) and BarrierBlocksOp arity (1→2); parse/validate extra args and forward scope/semantic in lowering.
CUDA templates (atomics / ldst / sync) `src/tl_templates/cuda/atomic.h`, `src/tl_templates/cuda/ldst.h`, `src/tl_templates/cuda/sync.h`	Expand Semantic and Scope enums (add RELAXED, ACQ_REL, CLUSTER), change PTX atomic helpers to accept raw addresses, introduce SyncScope/SyncSemantic enums and semantic-aware memory_fence_/ld/wait helpers; expose wait/memory_fence APIs with semantic parameters.
CUDA codegen `src/target/codegen_cuda.cc`	Remove special-case emission for `wait_eq`; forward fence semantic argument when emitting `tl::memory_fence_*` calls (include first arg in emitted call).
Language layer / Python API `tilelang/language/utils.py`, `tilelang/language/builtin.py`, `tilelang/language/distributed/common.py`	Add `MemoryScope` and `MemorySemantic` enums; update builtin wrappers (ld/st/atom_add/fence) to accept enums and pass numeric values; add `atom_add_remote()`; extend distributed wait_* helpers to accept peer/scope/semantic and forward to `tl.wait`.
Transform update `src/transform/warp_specialized_rewriter.cc`	Replace `wait_eq()` check with `WaitOp::Get()` in warp role promotion logic to reflect WaitOp representation change.

Sequence Diagram(s)

sequenceDiagram
  participant Launcher as Launcher (per-rank)
  participant TLKernel as TileLang Kernel
  participant CUDA as CUDA/PTX
  participant RemotePE as Remote PE memory
  participant Verifier as PyTorch verifier

  Launcher->>TLKernel: allocate buffers, init barriers/signals
  Launcher->>TLKernel: launch kernel (alltoall / torus)
  TLKernel->>CUDA: perform tiled copies and routing phases
  TLKernel->>CUDA: call atom_add_remote(addr,val,sem,scope)
  CUDA->>RemotePE: execute tl::ptx_atom_add_<sem>_<scope>(addr,val)
  RemotePE-->>CUDA: memory updated / ack
  TLKernel->>TLKernel: wait/barrier with scope/semantic
  TLKernel-->>Launcher: kernel completes
  Launcher->>Verifier: run PyTorch all_to_all_single and compare results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[Feat] Implement device and system level barrier for blocks based on atomic operations #9 — Overlapping changes to synchronization/barrier subsystem and CUDA sync templates.
[Feat] Add support for IPC-based warp-level pull from remote tensor and refactor relevent tests #14 — Related modifications to remote-copy/operator layers that interact with AtomAddRemoteOp additions.
[Refactor][Example] Add AG-GEMM overlapping example #28 — Related updates to wait/fence semantics and codegen for synchronization primitives.

Suggested reviewers

chengyupku

Poem

"I'm a hopping rabbit, coding speedy threads,
I stamp semantics where synchronization treads.
Atoms leap remote and fences hum in tune,
All‑to‑all parcels dance beneath the moon.
Hooray — data routed, now I munch a rune!"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Feature] Semantics support and remote atomic add' accurately summarizes the main changes, which include adding semantics/scope support for wait operations, implementing remote atomic add operations, and providing examples.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tzj-dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-04T07:03:42Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

…guage system

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/tl_templates/cuda/atomic.h (1)
244-281: ⚠️ Potential issue | 🟠 Major

Replace unsigned long with uintptr_t for device address parameters.

unsigned long is 32‑bit on some platforms (Windows, 32‑bit systems), risking silent truncation of 64‑bit device addresses. Use uintptr_t to maintain pointer‑width correctness across all platforms. This applies to all four sys-scope atomic functions: ptx_atom_add_relaxed_sys, ptx_atom_add_acquire_sys, ptx_atom_add_release_sys, and ptx_atom_add_acq_rel_sys.
🔧 Suggested fix
+#include <cstdint>
 ...
-TL_DEVICE uint32_t ptx_atom_add_relaxed_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_relaxed_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.relaxed.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
 
-TL_DEVICE uint32_t ptx_atom_add_acquire_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_acquire_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.acquire.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
 
-TL_DEVICE uint32_t ptx_atom_add_release_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_release_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.release.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
 
-TL_DEVICE uint32_t ptx_atom_add_acq_rel_sys(unsigned long addr,
+TL_DEVICE uint32_t ptx_atom_add_acq_rel_sys(uintptr_t addr,
                                             uint32_t value) {
   uint32_t ret;
   const uint32_t *ptr = reinterpret_cast<const uint32_t *>(addr);
   asm volatile("atom.add.acq_rel.sys.global.u32 %0, [%1], %2;\n"
                : "=r"(ret)
                : "l"(ptr), "r"(value));
   return ret;
 }
src/op/sync.h (1)

129-171: ⚠️ Potential issue | 🔴 Critical

Add need_fence to SEqualReduce, SHashReduce, and RegisterReflection.

The need_fence field (line 135) is used in the Lower method to affect code generation but is omitted from equality, hashing, and reflection checks. This causes two BarrierBlocksOpNode instances with different need_fence values to compare as equal and hash identically, creating a correctness issue for CSE and other IR-based optimizations.

Required changes

Add need_fence to:

RegisterReflection: .def_ro("need_fence", &BarrierBlocksOpNode::need_fence)

SEqualReduce: Include equal(need_fence, other->need_fence) in the return statement

SHashReduce: Add hash_reduce(need_fence);

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Line 42: The kernel signature currently includes an unused parameter barrier;
update the kernel or its invocation to remove the unused parameter or use it for
synchronization: either delete barrier from the kernel signature and also stop
allocating/passing it in run_torus_alltoall, or add explicit barrier
synchronization inside the kernel (referencing the kernel function where
barrier: T.Tensor((PE_num), "int32") is declared) and ensure run_torus_alltoall
continues to allocate and pass the barrier tensor so the symbol names match
(update both the kernel definition and the call sites in run_torus_alltoall).
- Around line 279-288: The script does not validate that X * Y equals PE_num
before spawning processes, which can cause incorrect torus topology; after
parsing args in the main block (the argparse code that sets args.M, args.N,
args.PE_num, args.X, args.Y) add a check that args.X * args.Y == args.PE_num and
raise a clear error (ValueError or argparse.ArgumentTypeError) or call
parser.error with a descriptive message if the equality fails so
run_torus_alltoall is only invoked with a consistent torus configuration.

In `@examples/distributed/intranode/example_alltoall.py`:
- Around line 73-75: The bandwidth formula is computing elements/sec but not
bytes/sec; update the print calculation to multiply by the element size (in
bytes) — e.g., use dtype_itemsize = np.dtype(tensor_dtype).itemsize or hard-code
2 for float16 — then compute total_bytes = 2 * PE_num * M * N * dtype_itemsize
and compute bandwidth_gbps = total_bytes / (elapsed_time * 1e6) (use the same
elapsed_time units as the printed ms) and print bandwidth_gbps; update the
expression where local_rank, PE_num, M, N, elapsed_time are used to include this
element-size factor.
- Around line 12-37: The kernel declares a barrier parameter that is never used
in main; either remove it from the kernel signature (and corresponding callers)
or implement inter-rank synchronization using that barrier instead of relying
solely on T.fence_sys. If you choose to use the barrier, after T.put_block and
T.fence_sys update the barrier tensor to signal this rank’s completion and then
wait/spin until all PE_num entries are set (i.e., implement a full all-ranks
barrier using the barrier buffer), referencing the symbols main, barrier,
T.put_block and T.fence_sys; if you choose removal, delete the barrier parameter
from main and all call sites.

In `@src/op/remote_copy.cc`:
- Around line 422-431: The code constructs a function name using sem_str[sem]
and scope_str[scope] without validating indices, which can lead to buffer
overreads; add bounds checks for sem (<6) and scope (<4) before indexing
(similar to StOpNode::Lower and LdOpNode::Lower), and handle out-of-range values
by selecting a safe default string or emitting an error/log and returning early
so the use in building "tl::ptx_atom_add_<sem>_<scope>" cannot index past
sem_str or scope_str.

In `@src/op/remote_copy.h`:
- Around line 327-328: Update the field comments for the "scope" and "sem"
members to match the actual enum/mapping used in the implementation: describe
scope as "{0: cta, 1: cluster, 2: gpu, 3: sys}" and describe sem as "{0: weak,
1: volatile, 2: relaxed, 3: acquire, 4: release, 5: acq_rel}" so the
documentation for the int members scope and sem in remote_copy.h matches the
implemented semantics.

In `@tilelang/language/builtin.py`:
- Around line 839-854: The is_remote check in the function incorrectly treats a
plain Python int dst_pe == -1 as remote because it only checks
isinstance(dst_pe, tir.IntImm); update the logic in the is_remote computation
(the variable named is_remote) to also consider a plain int -1 as local—e.g.,
treat dst_pe == -1 the same as tir.IntImm with value == -1 or normalize dst_pe
via getattr(dst_pe, "value", dst_pe) before comparison—so the remote branch (the
code that builds local_rank, local_base_ptr, offset_to_base, remote_ptr and
calls tir.call_extern with remote_ptr) is only taken when dst_pe truly indicates
a remote PE.
- Around line 737-738: Update the incorrect docstring on the function atom_add:
replace the copy-pasted PTX cp.async barrier description with a concise
description stating that atom_add performs an atomic add operation (e.g.,
performs an atomic addition of a value to a target and returns the previous
value or the new value per implementation), mention relevant semantics such as
data type(s) supported and whether the operation is relaxed/ordered or maps to a
specific PTX atomic instruction, and ensure the docstring matches the function
signature and behavior in atom_add.
- Around line 733-747: The atom_add function currently uses direct dict lookups
for scope_str and sem_str which will raise KeyError for unsupported MemoryScope
or MemorySemantic values; update atom_add to explicitly validate inputs before
using them (check the provided scope against allowed values {MemoryScope.GPU,
MemoryScope.SYSTEM} and sem against {MemorySemantic.RELAXED,
MemorySemantic.ACQUIRE, MemorySemantic.RELEASE, MemorySemantic.ACQ_REL}), and if
invalid raise a clear ValueError mentioning the parameter name, the unsupported
value, and the list of supported options; then proceed to build
scope_str/sem_str and call tir.call_intrin(address_of(barrier), value, sem_str,
scope_str) as before.

🧹 Nitpick comments (10)

src/op/remote_copy.cc (1)

276-281: Consider extracting duplicated string arrays to shared constants.

The sem_str and scope_str arrays are duplicated in StOpNode::Lower, LdOpNode::Lower, and AtomAddRemoteOpNode::Lower. This creates a maintenance burden — if the enum mappings change, all three locations must be updated.

Also applies to: 350-355, 424-428
examples/distributed/intranode/example_alltoall.py (1)
83-83: Consider using English comments for consistency.

The comment # 对比结果 (meaning "compare results") is in Chinese. For broader accessibility and consistency, consider using English comments.
✏️ Suggested change
-    # 对比结果
+    # Compare results
examples/distributed/intranode/example_alltoall_route2x4.py (2)

27-27: Unused function argument block_N.

The block_N parameter is passed to the function but never used. Consider removing it if it's not needed.

230-231: Large intermediate buffer allocation - consider documenting memory requirements.

The buffer_transfer tensor has shape (PE_num, PE_num, M, N). With default parameters (PE_num=8, M=8192, N=7168, float16), this requires approximately 7.5 GB per rank. Combined with src, dst, and other tensors, total memory per GPU could exceed 15 GB.

Consider adding a comment or documentation noting the memory requirements, or adding a check to warn users if the allocation might exceed available GPU memory.

src/tl_templates/cuda/ldst.h (1)

5-7: CLUSTER scope has no ld/st implementations.

Scope::CLUSTER is now public, but there are no TL_ST_IMPL/TL_LD_IMPL entries for it, so any use will fall into the unsupported static_assert. If CLUSTER isn’t intended for tl::ld/st yet, consider an explicit guard to make the limitation clearer; if it is intended, add the PTX mappings.

src/tl_templates/cuda/sync.h (1)

239-323: CTA/CLUSTER scopes and non‑acquire semantics collapse to GPU acquire.

ld_wait_generic treats any non‑SYSTEM scope as GPU, and ld_wait_{gpu,sys} only distinguish RELAXED/VOLATILE; all other semantics fall back to acquire. If CTA/CLUSTER or WEAK/RELEASE/ACQ_REL/SC are intended to be distinct, add explicit cases or assert on unsupported values to avoid silent mismatch.
tilelang/language/distributed/common.py (2)
126-131: Type hint for peer parameter is inaccurate.

The type hint peer: PrimExpr | None = -1 is misleading—the default -1 is an int, not None. This pattern repeats across all wait_* functions (lines 144-148, 162-166, 180-184, 198-202, 216-220).
Proposed fix for wait_eq (apply similar changes to other wait_* functions)
 def wait_eq(barrier: PrimExpr,
             expected: PrimExpr,
-            peer: PrimExpr | None = -1,
+            peer: PrimExpr | int = -1,
             scope: MemoryScope = MemoryScope.SYSTEM,
             semantic: MemorySemantic = MemorySemantic.ACQUIRE):
131-132: Address TODO comment about semantic differences.

The TODO at line 132 indicates there are semantic differences between wait_eq and the other wait functions that need clarification. This should be addressed to ensure proper documentation.

Would you like me to help draft documentation clarifying the semantic differences between wait_eq and other wait functions, or open an issue to track this?
tilelang/language/builtin.py (2)
827-833: Duplicate scope/sem mapping logic—consider consolidating.

This scope/semantic mapping (lines 827-833) duplicates the code in atom_add (lines 739-745). Both also lack validation for unsupported values. Consider extracting a helper function to avoid duplication and centralize validation.
Proposed helper functions
def _scope_to_str(scope: MemoryScope) -> str:
    """Convert MemoryScope enum to PTX string representation."""
    mapping = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}
    if scope not in mapping:
        raise ValueError(f"Unsupported scope for atomic operations: {scope}")
    return mapping[scope]

def _sem_to_str(sem: MemorySemantic) -> str:
    """Convert MemorySemantic enum to PTX string representation."""
    mapping = {
        MemorySemantic.RELAXED: "relaxed",
        MemorySemantic.ACQUIRE: "acquire",
        MemorySemantic.RELEASE: "release",
        MemorySemantic.ACQ_REL: "acq_rel"
    }
    if sem not in mapping:
        raise ValueError(f"Unsupported semantic for atomic operations: {sem}")
    return mapping[sem]
807-813: Type hint for dst_pe is inaccurate.

The type hint tir.PrimExpr | tir.IntImm | None = -1 has issues:

tir.IntImm is a subclass of tir.PrimExpr, making it redundant

The default -1 is a Python int, not covered by the type hint

None is in the type hint but the default is -1, not None
Proposed fix
 def atom_add_remote(
     dst: PrimExpr,
     value: PrimExpr,
     scope: MemoryScope = MemoryScope.SYSTEM,
     sem: MemorySemantic = MemorySemantic.RELAXED,
-    dst_pe: tir.PrimExpr | tir.IntImm | None = -1,
+    dst_pe: tir.PrimExpr | int = -1,
 ):

examples/distributed/intranode/example_alltoall_route2x4.py

coderabbitai · 2026-02-05T08:12:34Z

examples/distributed/intranode/example_alltoall_route2x4.py

+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--M", type=int, default=8192)
+    parser.add_argument("--N", type=int, default=7168)
+    parser.add_argument("--PE_num", type=int, default=8)
+    parser.add_argument("--X", type=int, default=2)
+    parser.add_argument("--Y", type=int, default=4)
+    args = parser.parse_args()
+
+    torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)


⚠️ Potential issue | 🟡 Minor

Add validation that X * Y == PE_num.

The torus topology requires X * Y == PE_num, but this is not validated. If a user provides inconsistent values (e.g., --PE_num 8 --X 3 --Y 3), the kernel would produce incorrect results or fail unexpectedly.

🛡️ Proposed fix

args = parser.parse_args() + + if args.X * args.Y != args.PE_num: + parser.error(f"X * Y must equal PE_num (got {args.X} * {args.Y} = {args.X * args.Y}, PE_num = {args.PE_num})") torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if __name__ == "__main__":

parser = argparse.ArgumentParser()

parser.add_argument("--M", type=int, default=8192)

parser.add_argument("--N", type=int, default=7168)

parser.add_argument("--PE_num", type=int, default=8)

parser.add_argument("--X", type=int, default=2)

parser.add_argument("--Y", type=int, default=4)

args = parser.parse_args()

torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)

if __name__ == "__main__":

parser = argparse.ArgumentParser()

parser.add_argument("--M", type=int, default=8192)

parser.add_argument("--N", type=int, default=7168)

parser.add_argument("--PE_num", type=int, default=8)

parser.add_argument("--X", type=int, default=2)

parser.add_argument("--Y", type=int, default=4)

args = parser.parse_args()

if args.X * args.Y != args.PE_num:

parser.error(f"X * Y must equal PE_num (got {args.X} * {args.Y} = {args.X * args.Y}, PE_num = {args.PE_num})")

torch.multiprocessing.spawn(run_torus_alltoall, args=(args.PE_num, args), nprocs=args.PE_num)

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 279 - 288, The script does not validate that X * Y equals PE_num before spawning processes, which can cause incorrect torus topology; after parsing args in the main block (the argparse code that sets args.M, args.N, args.PE_num, args.X, args.Y) add a check that args.X * args.Y == args.PE_num and raise a clear error (ValueError or argparse.ArgumentTypeError) or call parser.error with a descriptive message if the equality fails so run_torus_alltoall is only invoked with a consistent torus configuration.

coderabbitai · 2026-02-05T08:12:35Z

examples/distributed/intranode/example_alltoall.py

+    @T.prim_func
+    def main(
+            src: T.Tensor((PE_num * M, N), "float16"),
+            dst: T.Tensor((PE_num * M, N), "float16"),
+            barrier: T.Tensor((PE_num), "int32"),
+    ):
+        # Currently not support tiled copy
+        with T.Kernel(
+                PE_num, T.ceildiv(M, block_M), T.ceildiv(N, block_N),
+                threads=threads) as (bx, by, bz):
+            rank = T.alloc_local([1], "int32")
+            num_ranks = T.alloc_local([1], "int32")
+
+            dst_rank = bx
+            rank[0] = T.get_rank()
+            num_ranks[0] = T.get_num_ranks()
+
+            T.put_block(
+                src=T.address_of(src[dst_rank * M + by * block_M, 0]),
+                dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),
+                size=block_M * block_N,
+                dst_pe=dst_rank,
+            )
+            T.fence_sys(sem=T.MemorySemantic.RELEASE)
+
+    return main


⚠️ Potential issue | 🟡 Minor

Unused barrier parameter in kernel signature.

The barrier parameter is declared in the kernel signature but never used within the kernel body. This could indicate either:

A missing synchronization step that should use this barrier

An unnecessary parameter that should be removed

Given this is an all-to-all operation, typically a barrier or synchronization mechanism is needed to ensure all ranks have completed their transfers before the kernel returns. Currently, only T.fence_sys is called which provides memory ordering but not inter-rank synchronization.

💡 Suggested fix: Either use the barrier or remove it

Option 1 - Add barrier synchronization:

T.put_block( src=T.address_of(src[dst_rank * M + by * block_M, 0]), dst=T.address_of(dst[rank[0] * M + by * block_M, 0]), size=block_M * block_N, dst_pe=dst_rank, ) T.fence_sys(sem=T.MemorySemantic.RELEASE) + T.barrier_blocks(barrier) return main

Option 2 - Remove unused parameter:

`@T.prim_func` def main( src: T.Tensor((PE_num * M, N), "float16"), dst: T.Tensor((PE_num * M, N), "float16"), - barrier: T.Tensor((PE_num), "int32"), ):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@T.prim_func

def main(

src: T.Tensor((PE_num * M, N), "float16"),

dst: T.Tensor((PE_num * M, N), "float16"),

barrier: T.Tensor((PE_num), "int32"),

):

# Currently not support tiled copy

with T.Kernel(

PE_num, T.ceildiv(M, block_M), T.ceildiv(N, block_N),

threads=threads) as (bx, by, bz):

rank = T.alloc_local([1], "int32")

num_ranks = T.alloc_local([1], "int32")

dst_rank = bx

rank[0] = T.get_rank()

num_ranks[0] = T.get_num_ranks()

T.put_block(

src=T.address_of(src[dst_rank * M + by * block_M, 0]),

dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),

size=block_M * block_N,

dst_pe=dst_rank,

)

T.fence_sys(sem=T.MemorySemantic.RELEASE)

return main

`@T.prim_func`

def main(

src: T.Tensor((PE_num * M, N), "float16"),

dst: T.Tensor((PE_num * M, N), "float16"),

):

# Currently not support tiled copy

with T.Kernel(

PE_num, T.ceildiv(M, block_M), T.ceildiv(N, block_N),

threads=threads) as (bx, by, bz):

rank = T.alloc_local([1], "int32")

num_ranks = T.alloc_local([1], "int32")

dst_rank = bx

rank[0] = T.get_rank()

num_ranks[0] = T.get_num_ranks()

T.put_block(

src=T.address_of(src[dst_rank * M + by * block_M, 0]),

dst=T.address_of(dst[rank[0] * M + by * block_M, 0]),

size=block_M * block_N,

dst_pe=dst_rank,

)

T.fence_sys(sem=T.MemorySemantic.RELEASE)

return main

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 16-16: Unused function argument: barrier

(ARG001)

[warning] 21-21: Unpacked variable bz is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall.py` around lines 12 - 37, The kernel declares a barrier parameter that is never used in main; either remove it from the kernel signature (and corresponding callers) or implement inter-rank synchronization using that barrier instead of relying solely on T.fence_sys. If you choose to use the barrier, after T.put_block and T.fence_sys update the barrier tensor to signal this rank’s completion and then wait/spin until all PE_num entries are set (i.e., implement a full all-ranks barrier using the barrier buffer), referencing the symbols main, barrier, T.put_block and T.fence_sys; if you choose removal, delete the barrier parameter from main and all call sites.

coderabbitai · 2026-02-05T08:12:35Z

examples/distributed/intranode/example_alltoall.py

+    print(
+        f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N / (elapsed_time * 1e6):.3f} GB/s"
+    )


⚠️ Potential issue | 🟡 Minor

Bandwidth calculation may be incorrect - missing dtype size factor.

The bandwidth formula 2 * PE_num * M * N / (elapsed_time * 1e6) calculates based on element count, but GB/s should be based on bytes. For float16 (2 bytes per element), the formula should multiply by the element size:

🔧 Proposed fix

print( - f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N / (elapsed_time * 1e6):.3f} GB/s" + f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N * 2 / (elapsed_time * 1e6):.3f} GB/s" )

Alternatively, for clarity:

total_bytes = 2 * PE_num * M * N * 2 # bidirectional * elements * sizeof(float16) bandwidth_gbps = total_bytes / (elapsed_time * 1e6)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(

f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N / (elapsed_time * 1e6):.3f} GB/s"

)

print(

f"Rank {local_rank} Kernel execution time: {elapsed_time:.3f} ms, Bandwidth: {2 * PE_num * M * N * 2 / (elapsed_time * 1e6):.3f} GB/s"

)

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall.py` around lines 73 - 75, The bandwidth formula is computing elements/sec but not bytes/sec; update the print calculation to multiply by the element size (in bytes) — e.g., use dtype_itemsize = np.dtype(tensor_dtype).itemsize or hard-code 2 for float16 — then compute total_bytes = 2 * PE_num * M * N * dtype_itemsize and compute bandwidth_gbps = total_bytes / (elapsed_time * 1e6) (use the same elapsed_time units as the printed ms) and print bandwidth_gbps; update the expression where local_rank, PE_num, M, N, elapsed_time are used to include this element-size factor.

coderabbitai · 2026-02-05T08:12:35Z

src/op/remote_copy.cc

+  // Map integers to semantic literal strings for PTX atom instruction
+  // Unified Mapping: 2: relaxed, 3: acquire, 4: release, 5: acq_rel
+  const char *sem_str[] = {"weak",    "volatile", "relaxed",
+                           "acquire", "release",  "acq_rel"};
+  const char *scope_str[] = {
+      "cta", "cluster", "gpu",
+      "sys"}; // Unified: 2: gpu, 3: system (mapped below)
+
+  // Build function name: tl::ptx_atom_add_<sem>_<scope>
+  ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];


⚠️ Potential issue | 🔴 Critical

Missing bounds checks for sem and scope indices.

Unlike StOpNode::Lower (line 283-284) and LdOpNode::Lower (line 357-358), this method does not validate sem and scope before using them as array indices. If sem >= 6 or scope >= 4, this causes undefined behavior (buffer overread).

🐛 Proposed fix to add bounds checks

const char *sem_str[] = {"weak", "volatile", "relaxed", "acquire", "release", "acq_rel"}; const char *scope_str[] = { "cta", "cluster", "gpu", "sys"}; // Unified: 2: gpu, 3: system (mapped below) + ICHECK_LT(sem, 6) << "Invalid semantic value: " << sem; + ICHECK_LT(scope, 4) << "Invalid scope value: " << scope; + // Build function name: tl::ptx_atom_add_<sem>_<scope> ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Map integers to semantic literal strings for PTX atom instruction

// Unified Mapping: 2: relaxed, 3: acquire, 4: release, 5: acq_rel

const char *sem_str[] = {"weak", "volatile", "relaxed",

"acquire", "release", "acq_rel"};

const char *scope_str[] = {

"cta", "cluster", "gpu",

"sys"}; // Unified: 2: gpu, 3: system (mapped below)

// Build function name: tl::ptx_atom_add_<sem>_<scope>

ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];

// Map integers to semantic literal strings for PTX atom instruction

// Unified Mapping: 2: relaxed, 3: acquire, 4: release, 5: acq_rel

const char *sem_str[] = {"weak", "volatile", "relaxed",

"acquire", "release", "acq_rel"};

const char *scope_str[] = {

"cta", "cluster", "gpu",

"sys"}; // Unified: 2: gpu, 3: system (mapped below)

ICHECK_LT(sem, 6) << "Invalid semantic value: " << sem;

ICHECK_LT(scope, 4) << "Invalid scope value: " << scope;

// Build function name: tl::ptx_atom_add_<sem>_<scope>

ss << "tl::ptx_atom_add_" << sem_str[sem] << "_" << scope_str[scope];

🤖 Prompt for AI Agents

In `@src/op/remote_copy.cc` around lines 422 - 431, The code constructs a function name using sem_str[sem] and scope_str[scope] without validating indices, which can lead to buffer overreads; add bounds checks for sem (<6) and scope (<4) before indexing (similar to StOpNode::Lower and LdOpNode::Lower), and handle out-of-range values by selecting a safe default string or emitting an error/log and returning early so the use in building "tl::ptx_atom_add_<sem>_<scope>" cannot index past sem_str or scope_str.

coderabbitai · 2026-02-05T08:12:35Z

src/op/remote_copy.h

+  int scope;       ///< Memory scope (0: GPU, 1: SYS)
+  int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel)


⚠️ Potential issue | 🟡 Minor

Documentation does not match implementation.

The comments describe incorrect mappings:

scope: documented as (0: GPU, 1: SYS), but implementation uses {0: cta, 1: cluster, 2: gpu, 3: sys}.

sem: documented as (0: relaxed, 1: acquire, 2: release, 3: acq_rel), but implementation uses {0: weak, 1: volatile, 2: relaxed, 3: acquire, 4: release, 5: acq_rel}.

This inconsistency will mislead developers using this API.

📝 Suggested documentation fix

- int scope; ///< Memory scope (0: GPU, 1: SYS) - int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel) + int scope; ///< Memory scope (0: CTA, 1: CLUSTER, 2: GPU, 3: SYS) + int sem; ///< Memory semantic (0: WEAK, 1: VOLATILE, 2: RELAXED, 3: ACQUIRE, 4: RELEASE, 5: ACQ_REL)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int scope; ///< Memory scope (0: GPU, 1: SYS)

int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel)

int scope; ///< Memory scope (0: CTA, 1: CLUSTER, 2: GPU, 3: SYS)

int sem; ///< Memory semantic (0: WEAK, 1: VOLATILE, 2: RELAXED, 3: ACQUIRE, 4: RELEASE, 5: ACQ_REL)

🤖 Prompt for AI Agents

In `@src/op/remote_copy.h` around lines 327 - 328, Update the field comments for the "scope" and "sem" members to match the actual enum/mapping used in the implementation: describe scope as "{0: cta, 1: cluster, 2: gpu, 3: sys}" and describe sem as "{0: weak, 1: volatile, 2: relaxed, 3: acquire, 4: release, 5: acq_rel}" so the documentation for the int members scope and sem in remote_copy.h matches the implemented semantics.

coderabbitai · 2026-02-05T08:12:35Z

src/op/sync.cc

 WaitOp::WaitOp(Array<PrimExpr> args, BufferMap vmap) {
  ObjectPtr<WaitOpNode> node = make_object<WaitOpNode>();
-  node->relation = args[0].as<IntImmNode>()->value;
+  ICHECK_GE(args.size(), 4);
+  const auto *relation_node = args[0].as<IntImmNode>();
+  ICHECK(relation_node) << "Wait relation must be an integer";
+  node->relation = relation_node->value;
  node->addr = args[1];
  node->expected = args[2];
  node->peer = args[3];
+  // scope parameter is optional, default to SYSTEM (3) for safety
+  node->scope = (args.size() > 4) ? args[4].as<IntImmNode>()->value : 3;
+  // semantic parameter is optional, default to ACQUIRE (3) for safety
+  node->semantic = (args.size() > 5) ? args[5].as<IntImmNode>()->value : 3;


⚠️ Potential issue | 🟠 Major

Guard optional scope/semantic args before deref.

args[4] and args[5] are assumed IntImmNode and dereferenced directly, which will crash if a non-constant is passed. Add ICHECK (or accept non-const and validate later) to fail fast with a clear error.

✅ Suggested fix

- node->scope = (args.size() > 4) ? args[4].as<IntImmNode>()->value : 3; - // semantic parameter is optional, default to ACQUIRE (3) for safety - node->semantic = (args.size() > 5) ? args[5].as<IntImmNode>()->value : 3; + if (args.size() > 4) { + const auto *scope_node = args[4].as<IntImmNode>(); + ICHECK(scope_node) << "Wait scope must be an integer"; + node->scope = scope_node->value; + } else { + node->scope = 3; + } + // semantic parameter is optional, default to ACQUIRE (3) for safety + if (args.size() > 5) { + const auto *semantic_node = args[5].as<IntImmNode>(); + ICHECK(semantic_node) << "Wait semantic must be an integer"; + node->semantic = semantic_node->value; + } else { + node->semantic = 3; + }

coderabbitai · 2026-02-05T08:12:35Z

tilelang/language/builtin.py

+def atom_add(barrier: PrimExpr,
+             value: PrimExpr,
+             scope: MemoryScope = MemoryScope.GPU,
+             sem: MemorySemantic = MemorySemantic.RELAXED):
    """Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
    """
-    assert scope in ["gpu", "sys"], "Scope must be one of 'gpu', or 'sys'."
-    assert sem in ["relaxed", "acquire", "release", "acq_rel"
-                  ], "Semantic must be one of 'relaxed', 'acquire', 'release', or 'acq_rel'."
-    return tir.call_intrin("uint32", tir.op.Op.get("tl.atom_add"), address_of(barrier), value, sem,
-                           scope)
+    scope_str = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}[scope]
+    sem_str = {
+        MemorySemantic.RELAXED: "relaxed",
+        MemorySemantic.ACQUIRE: "acquire",
+        MemorySemantic.RELEASE: "release",
+        MemorySemantic.ACQ_REL: "acq_rel"
+    }[sem]
+    return tir.call_intrin("uint32", tir.op.Op.get("tl.atom_add"), address_of(barrier), value,
+                           sem_str, scope_str)


⚠️ Potential issue | 🟡 Minor

Missing error handling for unsupported scope/semantic values.

The dictionary lookups at lines 739-745 will raise KeyError if called with unsupported values (e.g., MemoryScope.CTA, MemoryScope.CLUSTER, or MemorySemantic.WEAK, MemorySemantic.VOLATILE, MemorySemantic.SC). Consider adding explicit validation with a helpful error message.

Proposed fix with validation

def atom_add(barrier: PrimExpr, value: PrimExpr, scope: MemoryScope = MemoryScope.GPU, sem: MemorySemantic = MemorySemantic.RELAXED): """Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc. """ - scope_str = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"}[scope] - sem_str = { + scope_map = {MemoryScope.GPU: "gpu", MemoryScope.SYSTEM: "sys"} + sem_map = { MemorySemantic.RELAXED: "relaxed", MemorySemantic.ACQUIRE: "acquire", MemorySemantic.RELEASE: "release", MemorySemantic.ACQ_REL: "acq_rel" - }[sem] + } + if scope not in scope_map: + raise ValueError(f"atom_add only supports GPU and SYSTEM scopes, got {scope}") + if sem not in sem_map: + raise ValueError(f"atom_add only supports RELAXED, ACQUIRE, RELEASE, ACQ_REL semantics, got {sem}") + scope_str = scope_map[scope] + sem_str = sem_map[sem] return tir.call_intrin("uint32", tir.op.Op.get("tl.atom_add"), address_of(barrier), value, sem_str, scope_str)

🤖 Prompt for AI Agents

In `@tilelang/language/builtin.py` around lines 733 - 747, The atom_add function currently uses direct dict lookups for scope_str and sem_str which will raise KeyError for unsupported MemoryScope or MemorySemantic values; update atom_add to explicitly validate inputs before using them (check the provided scope against allowed values {MemoryScope.GPU, MemoryScope.SYSTEM} and sem against {MemorySemantic.RELAXED, MemorySemantic.ACQUIRE, MemorySemantic.RELEASE, MemorySemantic.ACQ_REL}), and if invalid raise a clear ValueError mentioning the parameter name, the unsupported value, and the list of supported options; then proceed to build scope_str/sem_str and call tir.call_intrin(address_of(barrier), value, sem_str, scope_str) as before.

coderabbitai · 2026-02-05T08:12:35Z

tilelang/language/builtin.py

    """Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
    """


⚠️ Potential issue | 🟡 Minor

Incorrect docstring for atom_add.

The docstring describes "ptx async copy barrier using cp.async.mbarrier.arrive.noinc" which is incorrect—this function performs an atomic add operation. This appears to be copy-pasted from cp_async_barrier_noinc.

Proposed fix

def atom_add(barrier: PrimExpr, value: PrimExpr, scope: MemoryScope = MemoryScope.GPU, sem: MemorySemantic = MemorySemantic.RELAXED): - """Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc. + """Perform an atomic add operation on the given address. + + Args: + barrier: The address to perform atomic add on. + value: The value to add. + scope: The memory scope (GPU or SYSTEM). + sem: The memory semantic (RELAXED, ACQUIRE, RELEASE, or ACQ_REL). + + Returns: + tir.Call: The old value before the atomic add. """

🤖 Prompt for AI Agents

In `@tilelang/language/builtin.py` around lines 737 - 738, Update the incorrect docstring on the function atom_add: replace the copy-pasted PTX cp.async barrier description with a concise description stating that atom_add performs an atomic add operation (e.g., performs an atomic addition of a value to a target and returns the previous value or the new value per implementation), mention relevant semantics such as data type(s) supported and whether the operation is relaxed/ordered or maps to a specific PTX atomic instruction, and ensure the docstring matches the function signature and behavior in atom_add.

coderabbitai · 2026-02-05T08:12:35Z

tilelang/language/builtin.py

+    is_remote = not (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1)
+
+    if is_remote:
+        # Compute remote address: remote_base_ptr(dst_pe) + (address_of(dst) - remote_base_ptr(get_rank()))
+        local_rank = tir.Call("int64", tir.op.Op.get("tl.get_rank"), [])
+        local_base_ptr = tir.Call("handle", tir.op.Op.get("tl.get_remote_base_ptr"), [local_rank])
+        offset_to_base = tir.Sub(
+            tir.Call("handle", tir.op.Op.get("tl.get_uintptr_t"), [address_of(dst)]),
+            local_base_ptr)
+        remote_ptr = tir.Add(
+            tir.Call("handle", tir.op.Op.get("tl.get_remote_base_ptr"), [dst_pe]), offset_to_base)
+        # Call the PTX intrinsic directly with remote address
+        return tir.call_extern("uint32", func_name, remote_ptr, value)
+    else:
+        # Local atomic add
+        return tir.call_extern("uint32", func_name, address_of(dst), value)


⚠️ Potential issue | 🔴 Critical

Bug: is_remote check doesn't handle Python int default value.

The check at line 839 only handles tir.IntImm:

is_remote = not (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1)

However, the default value -1 is a Python int, not a tir.IntImm. When called with the default dst_pe=-1, isinstance(dst_pe, tir.IntImm) is False, making is_remote = True, which incorrectly triggers the remote path for local operations.

Proposed fix

# If dst_pe is specified and not -1, compute remote address - is_remote = not (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1) + is_local = (isinstance(dst_pe, int) and dst_pe == -1) or \ + (isinstance(dst_pe, tir.IntImm) and dst_pe.value == -1) + is_remote = not is_local if is_remote:

🤖 Prompt for AI Agents

In `@tilelang/language/builtin.py` around lines 839 - 854, The is_remote check in the function incorrectly treats a plain Python int dst_pe == -1 as remote because it only checks isinstance(dst_pe, tir.IntImm); update the logic in the is_remote computation (the variable named is_remote) to also consider a plain int -1 as local—e.g., treat dst_pe == -1 the same as tir.IntImm with value == -1 or normalize dst_pe via getattr(dst_pe, "value", dst_pe) before comparison—so the remote branch (the code that builds local_rank, local_base_ptr, offset_to_base, remote_ptr and calls tir.call_extern with remote_ptr) is only taken when dst_pe truly indicates a remote PE.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 325-327: The three print statements use f-strings without any
placeholders (e.g., print(f"Benchmark Results:"), print(f"  Average Latency:
{elapsed_time_ms:.4f} ms") — actually the latter two do use placeholders so only
remove the redundant f-prefix from lines that have no placeholders; specifically
change print(f"Benchmark Results:") to print("Benchmark Results:"), and likewise
any other literal-only prints in the same block, leaving prints that format
variables (like elapsed_time_ms and bandwidth_gbps) as f-strings; update the
print statements in the block containing "Benchmark Results:", "  Average
Latency:", and "  Effective Bandwidth:" accordingly.
- Around line 233-236: Validate and guard block sizing before launching the
kernel: ensure block_M is > 0 and that M is evenly divisible by block_M (or
implement tail handling) so the kernel’s copy of block_M * N elements cannot go
out-of-bounds; specifically, check/adjust the variables M, block_M, block_N and
threads used in the setup (where block_M = M // 8, block_N = N, num_blocks_M = M
// block_M) to either compute num_blocks_M = ceil_div(M, block_M) and handle the
final partial block inside the kernel or error out with a clear message when
block_M == 0 or M % block_M != 0.
- Around line 27-28: The function torus_alltoall_xy currently declares a
parameter block_N which is never used; remove block_N from the function
signature and from all call-sites that pass it (ensure callers now pass only
PE_num, X, Y, M, N, block_M, threads), and update any related documentation or
tests that reference the old signature; search for usages of torus_alltoall_xy
to update invocations and adjust any higher-level wrappers or type hints
referencing block_N to keep the API consistent.

examples/distributed/intranode/example_alltoall_route2x4.py

coderabbitai · 2026-02-05T10:09:31Z

examples/distributed/intranode/example_alltoall_route2x4.py

+    M, N = args.M, args.N
+    block_M, block_N = M // 8, N
+    threads = 128
+    num_blocks_M = M // block_M


⚠️ Potential issue | 🟠 Major

Guard against invalid block sizing (OOB/div-by-zero).
The kernel always copies block_M * N elements, so when M % block_M != 0 or block_M == 0, you can hit OOB or divide-by-zero. Add validation or tail handling.

✅ Suggested validation

- block_M, block_N = M // 8, N + block_M = max(1, M // 8) + if M % block_M != 0: + raise ValueError( + f"M must be divisible by block_M (M // 8). Got M={M}, block_M={block_M}" + ) threads = 128 num_blocks_M = M // block_M

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 233 - 236, Validate and guard block sizing before launching the kernel: ensure block_M is > 0 and that M is evenly divisible by block_M (or implement tail handling) so the kernel’s copy of block_M * N elements cannot go out-of-bounds; specifically, check/adjust the variables M, block_M, block_N and threads used in the setup (where block_M = M // 8, block_N = N, num_blocks_M = M // block_M) to either compute num_blocks_M = ceil_div(M, block_M) and handle the final partial block inside the kernel or error out with a clear message when block_M == 0 or M % block_M != 0.

examples/distributed/intranode/example_alltoall_route2x4.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 303-316: The benchmark loops reuse tensors so local_finish,
global_finish, and signal_transfer retain values across runs causing kernel (the
all-to-all transfer loop) to skip work; before each warmup and each timed
iteration (i.e., before calling kernel in both the 5 warmup iterations and
inside the num_iters loop) reset/zero-out those tensors (local_finish,
global_finish, signal_transfer) on the appropriate device so the kernel’s
while-loop condition behaves correctly; locate references to kernel(...,
buffer_transfer, signal_transfer, local_finish, global_finish, barrier) and
insert resets just prior to those calls.

🧹 Nitpick comments (2)

examples/distributed/intranode/example_alltoall_route2x4.py (2)
141-141: Remove debug print statement.

This T.print appears to be a leftover debug statement that will generate output during kernel execution, which may not be desired in production.
🧹 Proposed fix
-                T.print(bz, msg="block_idx")
199-212: Termination broadcast has O(PE_num³ × num_blocks_M) complexity.

The quad-nested loop broadcasts termination signals to all signal slots across all PEs. For larger configurations (e.g., 16 PEs with many blocks), this single-threaded broadcast from one block could become a bottleneck. Consider whether a more efficient termination mechanism (e.g., hierarchical signaling or collective barrier) would be beneficial for scalability.

coderabbitai · 2026-02-05T15:10:59Z

examples/distributed/intranode/example_alltoall_route2x4.py

+    if args.benchmark:
+        # Warmup
+        for _ in range(5):
+            kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier)
+        torch.cuda.synchronize()
+        dist.barrier(group_size)
+
+        start_event = torch.cuda.Event(enable_timing=True)
+        end_event = torch.cuda.Event(enable_timing=True)
+
+        num_iters = 10
+        start_event.record()
+        for _ in range(num_iters):
+            kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier)


⚠️ Potential issue | 🔴 Critical

Critical: Signal and finish tensors are not reset between benchmark iterations.

After the first kernel execution, local_finish, global_finish, and signal_transfer will contain stale values from the previous run. The kernel's while-loop condition (global_finish[0] < PE_num) will be immediately false on subsequent iterations, causing the kernel to skip the transfer phase entirely.

Reset these tensors before each iteration or at least before the warmup/timed loops.

🐛 Proposed fix

if args.benchmark: # Warmup for _ in range(5): + signal_transfer.fill_(PE_num + 1) + local_finish.fill_(0) + global_finish.fill_(0) kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier) torch.cuda.synchronize() dist.barrier(group_size) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) num_iters = 10 start_event.record() for _ in range(num_iters): + signal_transfer.fill_(PE_num + 1) + local_finish.fill_(0) + global_finish.fill_(0) kernel(src, dst, buffer_transfer, signal_transfer, local_finish, global_finish, barrier) end_event.record()

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 303 - 316, The benchmark loops reuse tensors so local_finish, global_finish, and signal_transfer retain values across runs causing kernel (the all-to-all transfer loop) to skip work; before each warmup and each timed iteration (i.e., before calling kernel in both the 5 warmup iterations and inside the num_iters loop) reset/zero-out those tensors (local_finish, global_finish, signal_transfer) on the appropriate device so the kernel’s while-loop condition behaves correctly; locate references to kernel(..., buffer_transfer, signal_transfer, local_finish, global_finish, barrier) and insert resets just prior to those calls.

…nization

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 236-243: The host-side block computation can produce num_blocks ==
0 and trigger a ZeroDivisionError when computing tile_M; modify the calculation
of num_blocks (used before computing tile_M) to guard against zero by ensuring
num_blocks = max(1, min(num_tiles, 148 // (PE_num * PE_num))) (or explicitly
check if num_blocks == 0 and set it to 1) so the subsequent tile_M = M //
(num_blocks * num_warps) cannot divide by zero; update references to num_blocks,
num_tiles, num_warps and tile_M accordingly.
- Around line 27-35: In torus_alltoall_xy, remove the hardcoded num_SM=148 and
either query the device SM count (e.g., cuda.get_device().MULTIPROCESSOR_COUNT
or similar) or add num_SM as a function parameter; remove or use the unused
tile_N parameter (Ruff ARG001) — if block_N was intended, rename consistently;
compute num_blocks = max(1, min(num_SM // (PE_num * PE_num), num_tiles)) to
avoid ZeroDivisionError when PE_num is large, then compute block_M,
tiles_per_block and keep the existing tiles_per_block <= num_warps assertion
using num_warps = threads // 32. Ensure all references use the updated names
(num_SM, num_blocks, block_M, tiles_per_block, num_warps).

🧹 Nitpick comments (1)

examples/distributed/intranode/example_alltoall.py (1)
40-56: group_size is misleading — it's a ProcessGroup, not a size.

init_dist returns (rank, world_size, ProcessGroup). The name group_size suggests a numeric value, but it's actually a dist.ProcessGroup object passed to dist.barrier() and dist.all_to_all_single(). Consider renaming to group for clarity.

Also, after line 48, local_rank is actually the global rank (from dist.get_rank()). In a single-node setup these are identical, but the naming could confuse readers (and would silently break if extended to multi-node).
♻️ Suggested rename
-    local_rank, num_ranks, group_size = init_dist(local_rank, num_ranks)
+    rank, world_size, group = init_dist(local_rank, num_ranks)
And update all downstream references accordingly.

coderabbitai · 2026-02-06T06:10:31Z

examples/distributed/intranode/example_alltoall_route2x4.py

+def torus_alltoall_xy(PE_num, X, Y, M, N, tile_M, tile_N, threads):
+    num_SM = 148
+    num_tiles = M // tile_M
+    num_blocks = min(num_SM // (PE_num * PE_num), num_tiles)
+    num_warps = threads // 32
+    block_M = M // num_blocks
+    tiles_per_block = block_M // tile_M
+
+    assert tiles_per_block <= num_warps, "Each warp should handle the signal and transfer of one tile"


⚠️ Potential issue | 🟡 Minor

num_SM is hardcoded to 148 (H100-specific) and tile_N is unused; large PE_num causes ZeroDivisionError.

num_SM = 148 is specific to H100/H200. On other architectures (e.g., A100 with 108 SMs), the occupancy calculation would be wrong.

tile_N is never used inside the function (Ruff ARG001) — likely a leftover from the block_N rename.

When PE_num >= 13, num_SM // (PE_num * PE_num) evaluates to 0, causing ZeroDivisionError at line 32 (block_M = M // num_blocks) and line 33.

🛡️ Suggested guard

num_SM = 148 num_tiles = M // tile_M num_blocks = min(num_SM // (PE_num * PE_num), num_tiles) + assert num_blocks > 0, f"PE_num={PE_num} is too large: num_SM // (PE_num^2) = 0" num_warps = threads // 32

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 27-27: Unused function argument: tile_N

(ARG001)

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 27 - 35, In torus_alltoall_xy, remove the hardcoded num_SM=148 and either query the device SM count (e.g., cuda.get_device().MULTIPROCESSOR_COUNT or similar) or add num_SM as a function parameter; remove or use the unused tile_N parameter (Ruff ARG001) — if block_N was intended, rename consistently; compute num_blocks = max(1, min(num_SM // (PE_num * PE_num), num_tiles)) to avoid ZeroDivisionError when PE_num is large, then compute block_M, tiles_per_block and keep the existing tiles_per_block <= num_warps assertion using num_warps = threads // 32. Ensure all references use the updated names (num_SM, num_blocks, block_M, tiles_per_block, num_warps).

coderabbitai · 2026-02-06T06:10:31Z

examples/distributed/intranode/example_alltoall_route2x4.py

+    tile_M, tile_N = M // 16, N
+    threads = 256
+
+    num_tiles = M // tile_M
+    num_blocks = min(num_tiles, 148 // (PE_num * PE_num))
+    num_warps = threads // 32
+    # Modify tile_M to use all warps
+    tile_M = M // (num_blocks * num_warps)


⚠️ Potential issue | 🟡 Minor

Same ZeroDivisionError risk in host-side block computation.

The host duplicates the kernel's num_blocks calculation (line 240). For PE_num >= 13, num_blocks becomes 0, causing ZeroDivisionError at line 243 (tile_M = M // (num_blocks * num_warps)). Add a guard here as well.

🛡️ Suggested guard

num_tiles = M // tile_M num_blocks = min(num_tiles, 148 // (PE_num * PE_num)) + if num_blocks == 0: + raise ValueError(f"PE_num={PE_num} is too large for num_SM=148: cannot allocate any blocks") num_warps = threads // 32

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 236 - 243, The host-side block computation can produce num_blocks == 0 and trigger a ZeroDivisionError when computing tile_M; modify the calculation of num_blocks (used before computing tile_M) to guard against zero by ensuring num_blocks = max(1, min(num_tiles, 148 // (PE_num * PE_num))) (or explicitly check if num_blocks == 0 and set it to 1) so the subsequent tile_M = M // (num_blocks * num_warps) cannot divide by zero; update references to num_blocks, num_tiles, num_warps and tile_M accordingly.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 20-26: The decorator call to tilelang.jit currently passes a
hardcoded developer-specific debug_root_path
(debug_root_path="/home/zhengju.tang/...") which must be removed or made
portable; update the tilelang.jit invocation (the pass_configs and debug
options) to either omit the debug_root_path argument entirely or replace it with
a relative path or environment-driven value (e.g., read from an env var or use
"./debug") so the example contains no absolute user home directories.
- Around line 114-157: chunk_size can become negative when tile_M < PE_num;
before calling T.put_warp (both branches) compute and guard chunk_size so it is
non-negative—for example clamp available = T.max(0, tile_M - chunk_start) and
set chunk_size = T.min(chunk_M, available) (or skip the
T.put_warp/T.atom_add/T.sync_warp sequence when chunk_size <= 0). Update the
calculations around chunk_M, chunk_start, chunk_size and the put_warp calls to
use this non-negative-clamped value (references: chunk_M, chunk_start,
chunk_size, bx, tile_M, PE_num, and the T.put_warp/T.atom_add calls).

coderabbitai · 2026-02-06T08:57:46Z

examples/distributed/intranode/example_alltoall_route2x4.py

+@tilelang.jit(
+    pass_configs={
+        tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,
+        tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True,
+    },
+    debug_root_path="/home/zhengju.tang/tilescale/examples/distributed/debug/"
+)


⚠️ Potential issue | 🟡 Minor

Remove the developer-specific hardcoded debug_root_path.

Line 25 contains an absolute path pointing to a developer's home directory (/home/zhengju.tang/...). This won't work on any other machine, leaks a username, and shouldn't ship in a public example. Either remove the argument entirely or use a relative path.

🔧 Proposed fix

`@tilelang.jit`( pass_configs={ tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, }, - debug_root_path="/home/zhengju.tang/tilescale/examples/distributed/debug/" )

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 20 - 26, The decorator call to tilelang.jit currently passes a hardcoded developer-specific debug_root_path (debug_root_path="/home/zhengju.tang/...") which must be removed or made portable; update the tilelang.jit invocation (the pass_configs and debug options) to either omit the debug_root_path argument entirely or replace it with a relative path or environment-driven value (e.g., read from an env var or use "./debug") so the example contains no absolute user home directories.

coderabbitai · 2026-02-06T08:57:46Z

examples/distributed/intranode/example_alltoall_route2x4.py

+            # Phase 1: Fully use all blocks to initially send from src to the target neighbor
+            # Split the tile_M to each block
+
+            chunk_M = T.ceildiv(tile_M, PE_num)
+            chunk_start = bx * chunk_M
+            chunk_size = T.min(chunk_M, tile_M - chunk_start)
+
+            # if src_rank[0] == rank[0]:
+            if dst_rank[0] != rank[0]:
+                T.put_warp(
+                    T.address_of(src[dst_rank[0] * M + bz * block_M + warp_idx * tile_M + chunk_start, 0]),
+                    T.address_of(buffer_transfer[rank[0], dst_rank[0], bz * block_M + warp_idx * tile_M + chunk_start, 0]),
+                    chunk_size * N,
+                    next_rank[0],
+                )
+                if tx % 32 == 0:
+                    T.atom_add_remote(
+                        signal_transfer[rank[0], dst_rank[0], bz, warp_idx],
+                        1,
+                        scope=T.MemoryScope.SYSTEM,
+                        sem=T.MemorySemantic.RELEASE,
+                        dst_pe=next_rank[0],
+                    )
+                T.sync_warp()
+                # T.fence_sys(sem=T.MemorySemantic.RELEASE)
+            else:
+                T.put_warp(
+                    T.address_of(src[dst_rank[0] * M + bz * block_M + warp_idx * tile_M + chunk_start, 0]),
+                    T.address_of(dst[rank[0] * M + bz * block_M + warp_idx * tile_M + chunk_start, 0]),
+                    chunk_size * N,
+                    -1,
+                )
+                if tx % 32 == 0:
+                    old_local[0] = T.atom_add(
+                        local_finish[0],
+                        1,
+                        scope=T.MemoryScope.GPU,
+                        sem=T.MemorySemantic.RELEASE,
+                    )
+                T.sync_warp()
+                # T.fence_cta(sem=T.MemorySemantic.RELEASE)
+
+            # T.fence_sys(sem=T.MemorySemantic.RELEASE)
+            T.sync_threads()


⚠️ Potential issue | 🟡 Minor

Phase 1: chunk_size can become negative for extreme parameter combinations.

When tile_M < PE_num, chunk_start = bx * chunk_M can exceed tile_M, making tile_M - chunk_start negative. T.min(chunk_M, negative_value) would then produce a negative chunk_size, which is passed as the size to put_warp. With the default parameters (M=8192, PE_num=8) this does not trigger, but a guard would prevent misuse.

🛡️ Suggested guard (inside the kernel, before the put_warp calls)

chunk_M = T.ceildiv(tile_M, PE_num) chunk_start = bx * chunk_M chunk_size = T.min(chunk_M, tile_M - chunk_start) - if dst_rank[0] != rank[0]: + if chunk_start < tile_M and dst_rank[0] != rank[0]:

(Same guard for the else branch.)

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 114 - 157, chunk_size can become negative when tile_M < PE_num; before calling T.put_warp (both branches) compute and guard chunk_size so it is non-negative—for example clamp available = T.max(0, tile_M - chunk_start) and set chunk_size = T.min(chunk_M, available) (or skip the T.put_warp/T.atom_add/T.sync_warp sequence when chunk_size <= 0). Update the calculations around chunk_M, chunk_start, chunk_size and the put_warp calls to use this non-negative-clamped value (references: chunk_M, chunk_start, chunk_size, bx, tile_M, PE_num, and the T.put_warp/T.atom_add calls).

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 165-166: The condition uses a boolean comparison "flag[0] ==
False"; change it to use a truthiness check by replacing "flag[0] == False" with
"not flag[0]" in the if statement that checks signal_transfer, dst_rank and flag
(the line referencing signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num
and flag[0] == False) so it conforms to PEP8 and handles non-boolean flag values
correctly.
- Line 245: The host-side assignment tile_M = M // (num_blocks * num_warps) is
dead code (unused) because the kernel computes its own tile_M; remove this
unused variable assignment (or instead pass it into the kernel if you intended
to use it) to fix the Ruff F841 warning—look for the symbol tile_M and the
related variables M, num_blocks, num_warps in example_alltoall_route2x4.py and
either delete the assignment or wire tile_M into the kernel invocation so it is
actually consumed.

🧹 Nitpick comments (1)

examples/distributed/intranode/example_alltoall_route2x4.py (1)

200-223: Termination detection: old_global[0] only reflects the last loop iteration's return value.

The loop at lines 201–208 calls atom_add_remote to each PE, but old_global[0] is overwritten each iteration and only retains the return from dst_pe = PE_num - 1. The subsequent check at line 209 (old_global[0] + 1 == PE_num) thus only gates on whether PE PE_num-1 has reached PE_num in its global_finish. This works as a distributed termination pattern (exactly one thread across all PEs will observe this transition), but the intent is subtle. A brief comment explaining why checking only the last PE suffices would help future readers.

coderabbitai · 2026-02-08T06:34:42Z

examples/distributed/intranode/example_alltoall_route2x4.py

+            if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0] == False:
+                flag[0] = True


⚠️ Potential issue | 🟡 Minor

Use not flag[0] instead of == False.

Comparing with == False is a PEP 8 / linting violation (E712) and can behave unexpectedly with non-boolean types.

- if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0] == False: + if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and not flag[0]:

🧰 Tools

🪛 Ruff (0.14.14)

[error] 165-165: Avoid equality comparisons to False; use not flag[0]: for false checks

Replace with not flag[0]

(E712)

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 165 - 166, The condition uses a boolean comparison "flag[0] == False"; change it to use a truthiness check by replacing "flag[0] == False" with "not flag[0]" in the if statement that checks signal_transfer, dst_rank and flag (the line referencing signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0] == False) so it conforms to PEP8 and handles non-boolean flag values correctly.

examples/distributed/intranode/example_alltoall_route2x4.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py`:
- Around line 276-278: The shape args passed to tilelang.tensor are plain
integers because (1) and (PE_num) are not tuples; change the single-element
shapes to proper 1-tuples when creating tensors in local_finish, global_finish
and barrier by passing (1,) and (PE_num,) to tilelang.tensor (keep allocator and
types the same).

🧹 Nitpick comments (1)

examples/distributed/intranode/example_alltoall_route2x4.py (1)

262-278: Large memory footprint — document hardware requirements.

buffer_transfer alone is (8, 8, 8192, 7168) × 2 bytes ≈ 6.7 GB per PE with default parameters. Combined with src, dst, and the 32 GB allocator, this example requires high-memory GPUs (e.g., H100 80 GB). Consider adding a brief note in the script or README about minimum GPU memory requirements.

coderabbitai · 2026-02-09T07:38:24Z

examples/distributed/intranode/example_alltoall_route2x4.py

+    local_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
+    global_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)
+    barrier = tilelang.tensor((PE_num), torch.int32, allocator=allocator).zero_()


⚠️ Potential issue | 🟡 Minor

Shape arguments (1) and (PE_num) are plain integers, not tuples.

In Python, (1) evaluates to 1 and (PE_num) evaluates to PE_num. The tilelang.tensor function expects a shape tuple. Use trailing commas to create proper 1-element tuples:

- local_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0) - global_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0) - barrier = tilelang.tensor((PE_num), torch.int32, allocator=allocator).zero_() + local_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0) + global_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0) + barrier = tilelang.tensor((PE_num,), torch.int32, allocator=allocator).zero_()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

local_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)

global_finish = tilelang.tensor((1), torch.uint32, allocator=allocator).fill_(0)

barrier = tilelang.tensor((PE_num), torch.int32, allocator=allocator).zero_()

local_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0)

global_finish = tilelang.tensor((1,), torch.uint32, allocator=allocator).fill_(0)

barrier = tilelang.tensor((PE_num,), torch.int32, allocator=allocator).zero_()

🤖 Prompt for AI Agents

In `@examples/distributed/intranode/example_alltoall_route2x4.py` around lines 276 - 278, The shape args passed to tilelang.tensor are plain integers because (1) and (PE_num) are not tuples; change the single-element shapes to proper 1-tuples when creating tensors in local_finish, global_finish and barrier by passing (1,) and (PE_num,) to tilelang.tensor (keep allocator and types the same).

tzj-fxz added 3 commits January 30, 2026 06:14

[Fence] Add fence options for barrier_blocks

c859a16

[Feature] Add remote atomic-add and more scopes/semantics for wait op

cd4e509

[Misc] Remove unused code

e37fea4

tzj-fxz added 9 commits February 4, 2026 07:07

[Example] Remove redundant buffer

12a98d0

[Example] Remove direction-related buffer

231dad1

[Refactor] Unified scope and semantic representation in tilescale lan…

b496a54

…guage system

[Misc] Add fence options

6f072ab

[BugFix] Intermediate buffer for each path

268f54a

[Lint] Block_M for alltoall

e8d036a

[Lint]

fc98be0

[Lint]

8404728

[BugFix] Add fence for inner CTA memory op

7b00d85

tzj-fxz marked this pull request as ready for review February 5, 2026 08:03

tzj-fxz added 2 commits February 5, 2026 08:04

[BugFix] Fence and debug

067511c

[Lint]

31a4643

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

[BugFix] Restore the signal to avoid duplicated sum of finish barrier

67065ab

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

[Lint]

e2d8ee3

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

tzj-fxz added 2 commits February 6, 2026 02:19

[BugFix] Warp-level scheduling with active blocks and correct synchro…

b00bdd8

…nization

[Example] Add benchmark options

2822ced

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

[Enhancement] Fully utilize blocks to send/recv data

81af526

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

[Routing] Optimize for balanced routing direction

5f584e5

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

[BugFix] Reinitialize the signal before benchmark

1d4bbd3

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

tzj-fxz added 7 commits February 9, 2026 10:20

[Feature] Add return value of wait op

b065199

[Routing] New version of routing

3845d36

[BugFix] Transfer source index before put data

6f570f4

[BugFix] Interface for benchmark

5abf3f1

[Misc] Remove log

a9dae4c

[BugFix] Warp level communication with robust per-slot signal

0d1dc1c

[Routing] AOT routing and signal slot assignment

baf1fc4

		int scope; ///< Memory scope (0: GPU, 1: SYS)
		int sem; ///< Memory semantic (0: relaxed, 1: acquire, 2: release, 3: acq_rel)

		"""Perform a ptx async copy barrier using cp.async.mbarrier.arrive.noinc.
		"""

		if signal_transfer[bx, dst_rank[0], bz, warp_idx] == PE_num and flag[0] == False:
		flag[0] = True

Comments

Conversation

tzj-fxz commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

tzj-fxz commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading